Goal:

  1. Provide a snapshot of the applicants.
  2. Main usage: a simple graph to showcase the talents
  3. Include important and impressive accomplishments
  4. Automated process
In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# from nltk.tokenize import RegexpTokenizer
from nltk.metrics import edit_distance
# from nltk.stem import WordNetLemmatizer
# from nltk.stem.porter import PorterStemmer
# from nltk.util import ngrams
# from nltk import pos_tag
import string
import spacy
import time
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df_raw = pd.read_csv("talent.csv",encoding = "ISO-8859-1")
# Convert column names
cols = ['index_col','education','talents','goal','team','work_remotely','start_date','platforms','pick_startups','pick_teams','add_value','achievements','ideal_job','extra']
df_raw.columns = cols
# Setting index
df_raw.index_col = df_raw.index_col.str.replace(',','').astype('int64')
df_raw.set_index('index_col', inplace=True)
df_raw.head(1)
Out[2]:
education talents goal team work_remotely start_date platforms pick_startups pick_teams add_value achievements ideal_job extra
index_col
1 Santa Clara University More than technical skills, I would say am goo... My prior experience was different from what I ... Doesn't matter. If it’s an option, I don't mind working remotely. NaN NaN NaN NaN NaN NaN NaN NaN
In [3]:
print(df_raw.info())
print('-'*40)
print(df_raw.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3310 entries, 1 to 3310
Data columns (total 13 columns):
education        3306 non-null object
talents          3309 non-null object
goal             3305 non-null object
team             3292 non-null object
work_remotely    3272 non-null object
start_date       2740 non-null object
platforms        1193 non-null object
pick_startups    2487 non-null object
pick_teams       2487 non-null object
add_value        333 non-null object
achievements     297 non-null object
ideal_job        981 non-null object
extra            1463 non-null object
dtypes: object(13)
memory usage: 362.0+ KB
None
----------------------------------------
education           4
talents             1
goal                5
team               18
work_remotely      38
start_date        570
platforms        2117
pick_startups     823
pick_teams        823
add_value        2977
achievements     3013
ideal_job        2329
extra            1847
dtype: int64
In [4]:
# Drop mostly null & useless columns
df_raw = df_raw.drop(['goal','team','work_remotely','start_date','platforms','pick_startups','pick_teams','ideal_job'], axis=1)
df_raw = df_raw.drop_duplicates(keep='first')
display(df_raw.head(1))
education talents add_value achievements extra
index_col
1 Santa Clara University More than technical skills, I would say am goo... NaN NaN NaN
In [5]:
# Separate the education column for easier analysis
df_new = df_raw.loc[:3256]
df_new = df_new.fillna(value='')
print(df_new.shape)
df_new.head()
(3239, 5)
Out[5]:
education talents add_value achievements extra
index_col
1 Santa Clara University More than technical skills, I would say am goo...
2 University of California, Riverside In regards to my talents, I have a strong bac...
3 University of California, Berkeley I have exceptional time management skills. As ...
4 San Jose State University I am goo with Back-end development and Machine...
5 University of California, San Diego I'm talented at self-control. I would finish m...

Remove punctuations

In [6]:
str_punc_list = list(string.punctuation) + ['。',',','(',')',':',';']
def remove_punctuation_list(text):
    text = text.lower()
    no_punct = "".join([c if c not in str_punc_list else ' ' for c in text])
    return no_punct
df_new['education'] = df_new.education.apply(remove_punctuation_list)
df_new.head()
Out[6]:
education talents add_value achievements extra
index_col
1 santa clara university More than technical skills, I would say am goo...
2 university of california riverside In regards to my talents, I have a strong bac...
3 university of california berkeley I have exceptional time management skills. As ...
4 san jose state university I am goo with Back-end development and Machine...
5 university of california san diego I'm talented at self-control. I would finish m...

Tokenize: break sentences into list of words

In [7]:
df_new['education'] = df_new.education.apply(word_tokenize)
df_new.head()
Out[7]:
education talents add_value achievements extra
index_col
1 [santa, clara, university] More than technical skills, I would say am goo...
2 [university, of, california, riverside] In regards to my talents, I have a strong bac...
3 [university, of, california, berkeley] I have exceptional time management skills. As ...
4 [san, jose, state, university] I am goo with Back-end development and Machine...
5 [university, of, california, san, diego] I'm talented at self-control. I would finish m...

Searched for degrees in 2 ways:

  1. If the keyword is in the tokenized string
    There is no simple way to search for all the degrees. Therefore, common degree names are hardcoded.
  2. A lot of people misspell words like "university", "bachelor", and "institute."
    For these words, use the edit_distance function in nltk to search for words that are likely to be typo of the words mentioned above.
In [8]:
bachelor_list = ['b','ba','bs', 'bba', 'basc', 'bas', 'bse', 'bsc',
                 'bsba', 'bsb', 'ibs', 'be', 'bfa', 'btech', 'bcom']

university_list = ['undergraduate','virginia', 'college','pomona',
                   'rutger', 'rutgers','uc', 'ucla', 'ucsd',
                   'uconn', 'uchicago', 'ucsb', 'ucl', 'uci','uiuc','byu']

ba_uni_syn_list = ['bachelor', 'university', 'institute', 'polytechnic']


master_list = ['m','ma','ms', 'msc', 'mscs', 'mse', 'msi', 'msf', 'msit', 
               'msim', 'mba', 'msba', 'msis', 'msu', 'msmis', 'msm',
               'master', 'masters','micromaster', 'micromasters']
mba_list = ['mba']

phd_list = ['phd','ph','doctor']

ba_uni_list = bachelor_list + university_list

# 1 - bachelor
# 2 - master
# 3 - mba
# 4 - phd

degree_dict = { 4 : phd_list, 
                3 : mba_list,
                2 : master_list,
                1 : ba_uni_list}

# degree_syn_dict = { 2 : ['master'],
#                     1 : ba_uni_syn_list}

def find_degree(word_list):
    for (key, value) in degree_dict.items():
        for v in value:
            if v in word_list:
                return key
    for i in ba_uni_syn_list:
        for w in word_list:
            if edit_distance(i, w) < 4:
                return 1
    return 0

df_new['degree'] = df_new.education.apply(find_degree)
In [9]:
df_new.pivot_table(index='degree', values='education', aggfunc='count')
Out[9]:
education
degree
0 79
1 2007
2 1090
3 34
4 29

Finding Skills:

Preprosessing:

  1. Remove punctuations
  2. Tokenize string
  3. Remove stopwords

Search for skills:

Again, some common skills required for each job are hardcoded.
A very rough scoring function is used to find the job each person belongs to.

In [10]:
#load the spacy module
nlp = spacy.load('en_core_web_sm')

#list of stopwords in english
stopwords_list = stopwords.words('english') + ['-PRON-']

def remove_stopwords(text):
    no_stopwords = [w for w in text if w not in stopwords_list]
    return no_stopwords

#use spacy to lemmatize each word
def lemma(text):
    text = remove_punctuation_list(text.lower())
    doc = nlp(text)
    doc_lemma = " ".join(token.lemma_ for token in doc)
    word_list = word_tokenize(doc_lemma)
    word_list = remove_stopwords(word_list)
    return word_list
In [11]:
df_new.head()
Out[11]:
education talents add_value achievements extra degree
index_col
1 [santa, clara, university] More than technical skills, I would say am goo... 1
2 [university, of, california, riverside] In regards to my talents, I have a strong bac... 1
3 [university, of, california, berkeley] I have exceptional time management skills. As ... 1
4 [san, jose, state, university] I am goo with Back-end development and Machine... 1
5 [university, of, california, san, diego] I'm talented at self-control. I would finish m... 1
In [12]:
for col in ['talents', 'add_value', 'achievements', 'extra']:
    df_new[col] = df_new[col].apply(lemma)
In [32]:
df_new
Out[32]:
education talents add_value achievements extra degree skill_score job_id deg_name job_name
index_col
1 ['santa', 'clara', 'university'] ['technical', 'skill', 'would', 'say', 'good',... [] [] [] 1 [0. 0. 0. 0.] 0 Bachelor Not specified
2 ['university', 'of', 'california', 'riverside'] ['regard', 'talent', 'strong', 'background', '... [] [] [] 1 [0. 0. 0. 0.] 0 Bachelor Not specified
3 ['university', 'of', 'california', 'berkeley'] ['exceptional', 'time', 'management', 'skill',... [] [] [] 1 [0. 0. 0. 0.] 0 Bachelor Not specified
4 ['san', 'jose', 'state', 'university'] ['goo', 'back', 'end', 'development', 'machine... [] [] [] 1 [1. 0. 0. 0.] 1 Bachelor Data Scientist
5 ['university', 'of', 'california', 'san', 'die... ['talente', 'self', 'control', 'would', 'finis... [] [] [] 1 [0. 0. 0. 0.] 0 Bachelor Not specified
... ... ... ... ... ... ... ... ... ... ...
3252 ['university', 'of', 'chicago', 'phillips', 'a... ['ever', 'since', 'discover', 'natural', 'tale... ['conduct', 'statistical', 'analysis', 'profas... ['madame', 'sarah', 'abbot', 'award', 'award',... [] 1 [3. 0. 0. 3.] 1 Bachelor Data Scientist
3253 ['university', 'of', 'central', 'florida', 'sa... ['excellent', 'communicator', 'think', 'love',... ['want', 'analyze', 'datum', 'utilize', 'measu... ['team', 'player', 'like', 'get', 'work', 'sak... ['win', 'need', 'sponsorship', 'green', 'card'... 1 [11. 2. 0. 0.] 1 Bachelor Data Scientist
3254 ['allentown', 'central', 'catholic', 'high', '... ['communication', 'skill', 'teamwork', 'hardwo... ['always', 'ready', 'learn', 'new', 'skill', '... ['graduate', 'high', 'school', 'honor', 'curre... ['great', 'sense', 'humor', 'put', '100', 'eff... 0 [0. 0. 0. 0.] 0 NaN Not specified
3255 ['uc', 'santa', 'cruz', 'cognitive', 'science'... ['think', 'main', 'talent', 'detail', 'orient'... ['design', 'wireframe', 'app', 'would', 'devel... ['able', 'obtain', 'dean', 'honor', 'take', 'g... [] 1 [0. 0. 0. 3.] 4 Bachelor Designer
3256 ['georgia', 'gwinnett', 'college', 'bachelors'... ['self', 'motivated', 'learner', 'teach', 'cod... ['structure', 'database', 'experieice', 'use',... ['teach', 'python', 'javascript', 'html', 'css... [] 1 [3. 2. 8. 1.] 3 Bachelor Web Developer

3239 rows × 10 columns

In [14]:
data_skill = ['sql','python','r','tableau','sas','spark','scala','database',
              'ml','scikit','regression','forest','classify','statistics','statistical','visualization',
              'analysis','analytic','mine','predictive','prescriptive','nlp']
software_skill = ['python','java','c++','linux','c','oracle','software','algorithm']
web_skill = ['html','css','javascript','php']
design_skill = ['ui','ux','uiux','design','adobe','photoshop','illustrator','ps']

# 1 - Data Scientist
# 2 - Software Engineer
# 3 - Web Developer
# 4 - Designer
skill_dict = {1: data_skill,
              2: software_skill,
              3: web_skill,
             4: design_skill}


long_skill_dict = {1: ['machine learning','deep learning','data mining','data modeling','data analytic','business analytic',
                       'data scientist','business intelligence','data cleaning','natural language processing'],
                   2: ['develop app','computer science','computer engineering','web services','data structure',
                      'software engineer','software development','software engineering','software developer'],
                   3: ['back-end','front-end','web app','mobile app','webpage design','website design','information architecture',
                      'web programming','web developer','web application','web design','web applications','information technology'],
                  4:['user interface','user experience','logo design','graphic design']}


def find_skill_score(text):
    skill_score = np.zeros(4)
    for (key, val) in skill_dict.items():
        for v in val:
            if v in text:
                skill_score[key-1]+=1
    for (key, val) in long_skill_dict.items():
        for v in val:
            if v in " ".join(text):
                skill_score[key-1]+=1      
    return skill_score

def find_max(arr):
    if (arr == np.zeros(4)).all():
        return 0
    else:
        return arr.argmax() + 1
In [15]:
df_talent_score = df_new['talents'].apply(find_skill_score)
df_value_score = df_new['add_value'].apply(find_skill_score)
df_extra_score = df_new['extra'].apply(find_skill_score)
df_new['skill_score'] = df_talent_score + df_value_score + df_extra_score
In [16]:
df_new['job_id'] = df_new.skill_score.apply(find_max)
In [17]:
df_new.pivot_table(index='job_id',values='education',aggfunc='count')
Out[17]:
education
job_id
0 1173
1 1374
2 408
3 99
4 185
In [18]:
deg_dict = {1 : 'Bachelor', 2 : 'Master', 3 : 'MBA', 4: 'Ph.D.'}
job_dict = {1 : 'Data Scientist', 2 : 'Software Engineer', 3 : 'Web Developer', 4: 'Designer', 0:'Not specified'}
In [19]:
df_new['deg_name'] = df_new.degree.map(deg_dict)
df_new['job_name'] = df_new.job_id.map(job_dict)
df_new.head()
Out[19]:
education talents add_value achievements extra degree skill_score job_id deg_name job_name
index_col
1 [santa, clara, university] [technical, skill, would, say, good, prioritiz... [] [] [] 1 [0.0, 0.0, 0.0, 0.0] 0 Bachelor Not specified
2 [university, of, california, riverside] [regard, talent, strong, background, project, ... [] [] [] 1 [0.0, 0.0, 0.0, 0.0] 0 Bachelor Not specified
3 [university, of, california, berkeley] [exceptional, time, management, skill, student... [] [] [] 1 [0.0, 0.0, 0.0, 0.0] 0 Bachelor Not specified
4 [san, jose, state, university] [goo, back, end, development, machine, learn, ... [] [] [] 1 [1.0, 0.0, 0.0, 0.0] 1 Bachelor Data Scientist
5 [university, of, california, san, diego] [talente, self, control, would, finish, work, ... [] [] [] 1 [0.0, 0.0, 0.0, 0.0] 0 Bachelor Not specified
In [20]:
df_new.to_csv('talent_deg_job.csv')
In [21]:
df_new = pd.read_csv('talent_deg_job.csv', index_col=0)

Visualization

With Plotly:

In [22]:
import plotly.graph_objects as go
In [23]:
deg_count = df_new.degree.value_counts()
ba_count = deg_count[1]
ma_count = deg_count[2]
mba_count = deg_count[3]
phd_count = deg_count[4]
deg_tot = deg_count.sum()

job_count = df_new.job_id.value_counts()
ds_count = job_count[1]
sde_count = job_count[2]
wd_count = job_count[3]
d_count = job_count[4]
job_tot = job_count.sum()
In [24]:
import plotly.offline as pyo
pyo.init_notebook_mode()
In [25]:
fig = go.Figure()

fig.add_trace(go.Indicator(
    mode = "number",
    value = ba_count,
    number = {'font': {'color': 'blue'}},
    title = {"text": "Bachelor's Degree"},
    domain = {'x': [0, 1/3], 'y': [0.66, 0.86]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = ma_count,
    number = {'font': {'color': 'blue'}},
    title = {"text": "Master Degree"},
    domain = {'x': [1/3, 2/3], 'y': [0.66, 0.86]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = phd_count,
    number = {'font': {'color': 'blue'}},
    title = {"text": "Ph.D."},
    domain = {'x': [2/3, 1], 'y': [0.66, 0.86]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = ds_count,
    number = {'font': {'color': 'orange'}},
    title = {"text": "Data Scientist"},
    domain = {'x': [0, 1/2], 'y': [0.33, 0.53]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = sde_count,
    number = {'font': {'color': 'orange'}},
    title = {"text": "Software Engineer"},
    domain = {'x': [1/2, 1], 'y': [0.33, 0.53]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = wd_count,
    number = {'font': {'color': 'orange'}},
    title = {"text": "Web Developer"},
    domain = {'x': [0, 1/2], 'y': [0, 0.2]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = d_count,
    number = {'font': {'color': 'orange'}},
    title = {"text": "Web Developer"},
    domain = {'x': [1/2, 1], 'y': [0, 0.2]}))
# fig.update_layout(paper_bgcolor = "lightgray")

fig.show()
In [26]:
sns.color_palette(palette = 'colorblind')

fig = plt.figure(figsize=(20,6))

ax1 = fig.add_subplot(1,2,1)
fig1 = sns.countplot(x='deg_name', data=df_new, ax=ax1)
fig1.set(xlabel='Degree Name', ylabel='Number of People with Each Degree', title='Degree Distribution')
for f in fig1.patches:
    h = f.get_height()
    fig1.text(f.get_x() + f.get_width()/2., h+20, h ,ha="center")

ax2 = fig.add_subplot(1,2,2)
fig2 = sns.countplot(x='job_name', data=df_new, ax=ax2)
fig2.set(xlabel='Skill Name', ylabel='Number of People with Each Skill', title='Skill Distribution')
for f in fig2.patches:
    h = f.get_height()
    fig2.text(f.get_x() + f.get_width()/2., h+20, h ,ha="center")
;
Out[26]:
''

Are the following data misplaced?

3 hackathon winners

In [27]:
df_new.loc[3257:,:]
Out[27]:
education talents add_value achievements extra degree skill_score job_id deg_name job_name
index_col
In [28]:
import pandas as pd
import numpy as np

import holoviews as hv
import plotly.graph_objects as go
import plotly.express as pex
In [29]:
hv.extension('bokeh')
In [30]:
df_grouped = df_new.groupby(by=["deg_name","job_name"]).size().to_frame('size')
df_grouped = df_grouped.reset_index()
print(df_grouped)
    deg_name           job_name  size
0   Bachelor     Data Scientist   649
1   Bachelor           Designer   135
2   Bachelor      Not specified   854
3   Bachelor  Software Engineer   300
4   Bachelor      Web Developer    69
5        MBA     Data Scientist    16
6        MBA           Designer     1
7        MBA      Not specified    13
8        MBA  Software Engineer     4
9     Master     Data Scientist   671
10    Master           Designer    40
11    Master      Not specified   260
12    Master  Software Engineer    91
13    Master      Web Developer    28
14     Ph.D.     Data Scientist    16
15     Ph.D.      Not specified    10
16     Ph.D.  Software Engineer     3
In [31]:
plot1 = hv.Sankey(df_grouped)
plot1.opts(cmap='Colorblind',label_position='left',
                                 edge_color='job_name', edge_line_width=0,
                                 node_alpha=1.0, node_width=40, node_sort=True,
                                 width=800, height=600, bgcolor="snow",
                                 title="Forkaia Talent Snapshot")
Out[31]:
In [ ]:
 
In [ ]: